Question 1

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)")Which code block should be used to create the date Python variable used in the above code block?


Question 2

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.


Question 3

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?


Question 4

Question 4 Image 1

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.The below query is used to create the alert:The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?


Question 5

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.Which approach will allow this developer to review the current logic for this notebook?


Question 6

Question 6 Image 1

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).Which statement describes what will happen when the above code is executed?


Question 7

Question 7 Image 1

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.Which code block accomplishes this task while minimizing potential compute costs?


Question 8

Question 8 Image 1

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?


Question 9

Question 9 Image 1

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?


Question 10

A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOATTo find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3Which statement describes how the Delta engine identifies which files to load?


Question 11

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?


Question 12

Question 12 Image 1

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?


Question 13

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.Which solution meets these requirements?


Question 14

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINTNew records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?


Question 15

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.Which approach would simplify the identification of these changed records?


Question 16

Question 16 Image 1

A table is registered with the following code:Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?


Question 17

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.Which of the following likely explains these smaller file sizes?


Question 18

Which statement regarding stream-static joins and static Delta tables is correct?


Question 19

Question 19 Image 1

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Choose the response that correctly fills in the blank within the code block to complete this task.


Question 20

Question 20 Image 1

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.The proposed directory structure is displayed below:Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?


Question 21

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?


Question 22

Which statement describes Delta Lake Auto Compaction?


Question 23

Which statement characterizes the general programming model used by Spark Structured Streaming?


Question 24

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?


Question 25

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.Which situation is causing increased duration of the overall job?


Question 26

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?


Question 27

Question 27 Image 1

A junior data engineer on your team has implemented the following code block.The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.When this query is executed, what will happen with new records that have the same event_id as an existing record?


Question 28

Question 28 Image 1

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:Which statement describes the execution and results of running the above query multiple times?


Question 29

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.Which describes how Delta Lake can help to avoid data loss of this nature in the future?


Question 30

Question 30 Image 1

A nightly job ingests data into a Delta Lake table using the following code:The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.Which code snippet completes this function definition?def new_records():


Question 31

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?


Question 32

Question 32 Image 1

The data engineering team maintains the following code:Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?


Question 33

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.Which statement exemplifies best practices for implementing this system?


Question 34

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.Which approach will ensure that this requirement is met?


Question 35

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?


Question 36

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATEThis table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20Which statement describes how data will be filtered?


Question 37

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?


Question 38

Question 38 Image 1

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.Which statement explains the cause of this failure?


Question 39

Which of the following is true of Delta Lake and the Lakehouse?


Question 40

Question 40 Image 1

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.The following logic is used to process these records.Which statement describes this implementation?


Question 41

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?


Question 42

Question 42 Image 1

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.The user_ltv table has the following schema:email STRING, age INT, ltv INTThe following view definition is executed:An analyst who is not a member of the marketing group executes the following query:SELECT * FROM email_ltv -Which statement describes the results returned by this query?


Question 43

Question 43 Image 1

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.The following SQL DDL statement is executed to create a new table:Which command allows manual confirmation that these three requirements have been met?


Question 44

Question 44 Image 1

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?


Question 45

Question 45 Image 1

Question 45 Image 2

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.The following logic was executed to create a database for the finance team:After the database was successfully created and permissions configured, a member of the finance team runs the following code:If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?


Question 46

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.Which statement describes a limitation of Databricks Secrets?


Question 47

What statement is true regarding the retention of job run history?


Question 48

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.Which statement describes the contents of the workspace audit logs concerning these events?


Question 49

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?


Question 50

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?


Question 51

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?


Question 52

Question 52 Image 1

Review the following error traceback:Which statement describes the error being raised?


Question 53

Which distribution does Databricks support for installing custom Python code packages?


Question 54

Which Python variable contains a list of directories to be searched when trying to locate required modules?


Question 55

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.Which statement describes a main benefit that offset this additional effort?


Question 56

Which statement describes integration testing?


Question 57

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?


Question 58

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?


Question 59

Question 59 Image 1

A Delta Lake table was created with the below query:Realizing that the original query had a typographical error, the below code was executed:ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_storeWhich result will occur after running the second command?


Question 60

Question 60 Image 1

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOATIf daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?


Question 61

Question 61 Image 1

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.Which command should be removed from the notebook before scheduling it as a job?


Question 62

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?


Question 63

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:SELECT COUNT (*) FROM table -Which of the following describes how results are generated each time the dashboard is updated?


Question 64

Question 64 Image 1

A Delta Lake table was created with the below query:Consider the following query:DROP TABLE prod.sales_by_store -If this statement is executed by a workspace admin, which result will occur?


Question 65

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().Which of the following statements is correct?


Question 66

Question 66 Image 1

The following code has been migrated to a Databricks notebook from a legacy workload:The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.Which statement is a possible explanation for this behavior?


Question 67

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:item_id INT, user_id INT, review_id INT, rating FLOAT, review STRINGThe review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.A junior data engineer suggests converting this data to Delta Lake will improve query performance.Which response to the junior data engineer s suggestion is correct?


Question 68

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?


Question 69

Question 69 Image 1

Question 69 Image 2

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.Which solution meets the expectations of the end users while controlling and limiting possible costs?


Question 70

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.Which strategy will yield the best performance without shuffling data?


Question 71

Question 71 Image 1

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Choose the response that correctly fills in the blank within the code block to complete this task.


Question 72

Question 72 Image 1

Question 72 Image 2

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.Original query:Proposed query:Proposed query:.start(“/item_agg”)Which step must also be completed to put the proposed query into production?


Question 73

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?


Question 74

Which statement describes the correct use of pyspark.sql.functions.broadcast?


Question 75

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?


Question 76

A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day.A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays.Which limitation will the team face while diagnosing this problem?


Question 77

Question 77 Image 1

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.The function is displayed below with a blank:Which response correctly fills in the blank to meet the specified requirements?


Question 78

Question 78 Image 1

The data engineering team maintains the following code:Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?


Question 79

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.Which approach will ensure that this requirement is met?


Question 80

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales org.Which of the following solutions addresses the situation while emphasizing simplicity?


Question 81

Question 81 Image 1

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.Which statement describes the outcome of this batch insert?


Question 82

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.Which statement explains what is preventing this privilege transfer?


Question 83

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONGThere are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.Which of the following solutions meets the requirements?


Question 84

The data architect has decided that once data has been ingested from external sources into theDatabricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.GRANT USAGE ON DATABASE prod TO eng;GRANT SELECT ON DATABASE prod TO eng;Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?


Question 85

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.In which location can one review the timeline for cluster resizing events?


Question 86

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?


Question 87

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?


Question 88

You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9)Which kind of test would the above line exemplify?


Question 89

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.If task A fails during a scheduled run, which statement describes the results of this run?


Question 90

Which statement regarding Spark configuration on the Databricks platform is true?


Question 91

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?


Question 92

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.Which statement describes why the cloned tables are no longer working?


Question 93

You are performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.Which code block attempts to perform an invalid stream-static join?


Question 94

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?


Question 95

Question 95 Image 1

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?


Question 96

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.Which consideration will impact the decisions made by the engineer while migrating this workload?


Question 97

A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table.The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.Which piece of information is critical to this decision?


Question 98

Question 98 Image 1

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.The user_ltv table has the following schema:email STRING, age INT, ltv INTThe following view definition is executed:An analyst who is not a member of the auditing group executes the following query:SELECT * FROM user_ltv_no_minorsWhich statement describes the results returned by this query?


Question 99

Question 99 Image 1

The data governance team is reviewing code used for deleting records for compliance with GDPR. The following logic has been implemented to propagate delete requests from the user_lookup table to the user_aggregates table.Assuming that user_id is a unique identifying key and that all users that have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?


Question 100

The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?


Question 101

Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level.


Question 102

What is the first line of a Databricks Python notebook when viewed in a text editor?


Question 103

Which statement describes a key benefit of an end-to-end test?


Question 104

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id.Which statement describes what the number alongside this field represents?


Question 105

Question 105 Image 1

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?


Question 106

A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRINGThe ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.Which solution minimizes the compute costs to propagate this batch of data?


Question 107

Which statement describes Delta Lake optimized writes?


Question 108

Which statement describes the default execution mode for Databricks Auto Loader?


Question 109

A Delta Lake table representing metadata about content posts from users has the following schema:user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATEBased on the above schema, which column is a good candidate for partitioning the Delta Table?


Question 110

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.Which of the following solutions would you implement to achieve this requirement?


Question 111

Which describes a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?


Question 112

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?


Question 113

A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.Given the current implementation, which method can be used?


Question 114

Question 114 Image 1

Question 114 Image 2

A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.Original query:Proposed query:Which step must also be completed to put the proposed query into production?


Question 115

When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?


Question 116

The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.Which statement captures best practices for this situation?


Question 117

A data engineer, User A, has promoted a pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.A workspace admin, User C, inherits responsibility for managing this pipeline. User C uses the Databricks Jobs UI to take "Owner" privileges of each job. Jobs continue to be triggered using the credentials and tooling configured by User B.An application has been configured to collect and parse run information returned by the REST API. Which statement describes the value returned in the creator_user_name field?


Question 118

Question 118 Image 1

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.Which command should be removed from the notebook before scheduling it as a job?


Question 119

Which statement regarding Spark configuration on the Databricks platform is true?


Question 120

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes.Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?


Question 121

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:SELECT COUNT (*) FROM table -Which of the following describes how results are generated each time the dashboard is updated?


Question 122

Question 122 Image 1

A Delta Lake table was created with the below query:Consider the following query:DROP TABLE prod.sales_by_store -If this statement is executed by a workspace admin, which result will occur?


Question 123

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?


Question 124

Question 124 Image 1

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).Which statement describes what will happen when the above code is executed?


Question 125

Question 125 Image 1

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?


Question 126

Question 126 Image 1

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.What will be the outcome of executing these command cells m order m an interactive notebook?


Question 127

The data science team has requested assistance in accelerating queries on free-form text from user reviews. The data is currently stored in Parquet with the below schema:item_id INT, user_id INT, review_id INT, rating FLOAT, review STRINGThe review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.A junior data engineer suggests converting this data to Delta Lake will improve query performance.Which response to the junior data engineer’s suggestion is correct?


Question 128

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?


Question 129

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?


Question 130

Question 130 Image 1

Question 130 Image 2

Question 130 Image 3

The following table consists of items found in user carts within an e-commerce website.The following MERGE statement is used to update this table using an updates view, with schema evolution enabled on this table.How would the following update be handled?


Question 131

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.Which approach can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?


Question 132

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINTNew records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.Which implementation can be used to efficiently update the described account_current table as part of each hourly batch job assuming there are millions of user accounts and tens of thousands of records processed hourly?


Question 133

Question 133 Image 1

Question 133 Image 2

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.Which solution meets the expectations of the end users while controlling and limiting possible costs?


Question 134

A Delta lake table with CDF enabled table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.Which approach would simplify the identification of these changed records?


Question 135

Question 135 Image 1

A view is registered with the following code:Both users and orders are Delta Lake tables.Which statement describes the results of querying recent_orders?


Question 136

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.Which strategy will yield the best performance without shuffling data?


Question 137

Identify how the count_if function and the count where x is null can be usedConsider a table random_values with below data.What would be the output of below query?select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1012NULL -23


Question 138

Question 138 Image 1

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Which line of code correctly fills in the blank within the code block to complete this task?


Question 139

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?


Question 140

Which statement describes Delta Lake optimized writes?


Question 141

Which statement characterizes the general programming model used by Spark Structured Streaming?


Question 142

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?


Question 143

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.Which situation is causing increased duration of the overall job?


Question 144

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?


Question 145

Question 145 Image 1

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?


Question 146

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?


Question 147

Question 147 Image 1

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:Which statement describes the execution and results of running the above query multiple times?


Question 148

A DLT pipeline includes the following streaming tables:• raw_iot ingests raw device measurement data from a heart rate tracking device.• bpm_stats incrementally computes user statistics based on BPM measurements from raw_iot.How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table, while recomputing the downstream table bpm_stats table when a pipeline update is run?


Question 149

A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day.A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays.Which limitation will the team face while diagnosing this problem?


Question 150

Question 150 Image 1

A nightly job ingests data into a Delta Lake table using the following code:The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.Which code snippet completes this function definition?def new_records():


Question 151

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?


Question 152

Question 152 Image 1

The data engineering team maintains the following code:Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?


Question 153

The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.Which statement captures best practices for this situation?


Question 154

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.Which approach will ensure that this requirement is met?


Question 155

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales org.Which of the following solutions addresses the situation while emphasizing simplicity?


Question 156

A Delta Lake table representing metadata about content posts from users has the following schema:user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATEThis table is partitioned by the date column. A query is run with the following filter:longitude < 20 & longitude > -20Which statement describes how data will be filtered?


Question 157

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?


Question 158

Question 158 Image 1

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.Which statement describes the outcome of this batch insert?


Question 159

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.Which consideration will impact the decisions made by the engineer while migrating this workload?


Question 160

A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table.The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.Which piece of information is critical to this decision?


Question 161

Question 161 Image 1

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impressions led to monetizable clicks.In the code below, Impressions is a streaming DataFrame with a watermark ("event_time", "10 minutes")The data engineer notices the query slowing down significantly.Which solution would improve the performance?


Question 162

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.Which statement explains what is preventing this privilege transfer?


Question 163

Question 163 Image 1

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.The user_ltv table has the following schema:email STRING, age INT, ltv INTThe following view definition is executed:An analyst who is not a member of the auditing group executes the following query:SELECT * FROM user_ltv_no_minorsWhich statement describes the results returned by this query?


Question 164

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONGThere are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.Which solution meets the requirements?


Question 165

Question 165 Image 1

The data governance team is reviewing code used for deleting records for compliance with GDPR. The following logic has been implemented to propagate delete requests from the user_lookup table to the user_aggregates table.Assuming that user_id is a unique identifying key and that all users that have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?


Question 166

Question 166 Image 1

Question 166 Image 2

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.The following logic was executed to create a database for the finance team:After the database was successfully created and permissions configured, a member of the finance team runs the following code:If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?


Question 167

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.Which of the following tools can the data engineer use to solve this problem?


Question 168

What is the retention of job run history?


Question 169

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.Which statement describes the contents of the workspace audit logs concerning these events?


Question 170

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.In which location can one review the timeline for cluster resizing events?


Question 171

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?


Question 172

The data engineer is using Spark's MEMORY_ONLY storage level.Which indicators should the data engineer look for in the Spark UI's Storage tab to signal that a cached table is not performing optimally?


Question 173

Question 173 Image 1

Review the following error traceback:Which statement describes the error being raised?


Question 174

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?


Question 175

A Databricks single-task workflow fails at the last task due to an error in a notebook. The data engineer fixes the mistake in the notebook.What should the data engineer do to rerun the workflow?


Question 176

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.Which benefit offsets this additional effort?


Question 177

What describes integration testing?


Question 178

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id.Which statement describes what the number alongside this field represents?


Question 179

A Databricks job has been configured with three tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.What will be the resulting state if tasks A and B complete successfully but task C fails during a scheduled run?


Question 180

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?


Question 181

Question 181 Image 1

A Delta Lake table was created with the below query:Realizing that the original query had a typographical error, the below code was executed:ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_storeWhich result will occur after running the second command?


Question 182

Question 182 Image 1

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.The below query is used to create the alert:The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?


Question 183

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.Which approach will allow this developer to review the current logic for this notebook?


Question 184

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().Which of the following statements is correct?


Question 185

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:df = spark.read.format("parquet").load(f"/mnt/source/(date)")Which code block should be used to create the date Python variable used in the above code block?


Question 186

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.


Question 187

Question 187 Image 1

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.Which code block accomplishes this task while minimizing potential compute costs?


Question 188

Question 188 Image 1

The following code has been migrated to a Databricks notebook from a legacy workload:The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.Which statement is a possible explanation for this behavior?


Question 189

A Delta table of weather records is partitioned by date and has the below schema:date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOATTo find all the records from within the Arctic Circle, you execute a query with the below filter:latitude > 66.3Which statement describes how the Delta engine identifies which files to load?


Question 190

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.Which statement describes why the cloned tables are no longer working?


Question 191

Question 191 Image 1

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?


Question 192

A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.Given the current implementation, which method can be used?


Question 193

Question 193 Image 1

A view is registered with the following code:Both users and orders are Delta Lake tables.Which statement describes the results of querying recent_orders?


Question 194

A data engineer is performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.Which code block attempts to perform an invalid stream-static join?


Question 195

Question 195 Image 1

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Choose the response that correctly fills in the blank within the code block to complete this task.


Question 196

Question 196 Image 1

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.The proposed directory structure is displayed below:Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?


Question 197

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?


Question 198

Which statement describes the default execution mode for Databricks Auto Loader?


Question 199

Which statement describes the correct use of pyspark.sql.functions.broadcast?


Question 200

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?


Question 201

Question 201 Image 1

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?


Question 202

Question 202 Image 1

A junior data engineer on your team has implemented the following code block.The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.When this query is executed, what will happen with new records that have the same event_id as an existing record?


Question 203

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.Which describes how Delta Lake can help to avoid data loss of this nature in the future?


Question 204

Question 204 Image 1

The data engineering team maintains the following code:Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?


Question 205

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.Which statement exemplifies best practices for implementing this system?


Question 206

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.Which approach will ensure that this requirement is met?


Question 207

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?


Question 208

A Delta Lake table representing metadata about content posts from users has the following schema:user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATEBased on the above schema, which column is a good candidate for partitioning the Delta Table?


Question 209

Question 209 Image 1

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.Which statement explains the cause of this failure?


Question 210

What is true for Delta Lake?


Question 211

Question 211 Image 1

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.The following logic is used to process these records.Which statement describes this implementation?


Question 212

A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks. One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.What approach would allow them to do this?


Question 213

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?


Question 214

Question 214 Image 1

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.The user_ltv table has the following schema:email STRING, age INT, ltv INTThe following view definition is executed:An analyst who is not a member of the marketing group executes the following query:SELECT * FROM email_ltv -Which statement describes the results returned by this query?


Question 215

Question 215 Image 1

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PII) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.The following SQL DDL statement is executed to create a new table:Which command allows manual confirmation that these three requirements have been met?


Question 216

Question 216 Image 1

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?


Question 217

The data architect has decided that once data has been ingested from external sources into theDatabricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.GRANT USAGE ON DATABASE prod TO eng;GRANT SELECT ON DATABASE prod TO eng;Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?


Question 218

Question 218 Image 1

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.The user attempts and fails to accomplish this by adding an expectation to the report table definition.Which approach would allow using DLT expectations to validate all expected records are present in this table?


Question 219

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?


Question 220

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?


Question 221

A data engineer needs to capture pipeline settings from an existing setting in the workspace, and use them to create and version a JSON file to create a new pipeline.Which command should the data engineer enter in a web terminal configured with the Databricks CLI?


Question 222

Which Python variable contains a list of directories to be searched when trying to locate required modules?


Question 223

None


Question 224

None


Question 225

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?


Question 226

A Data Engineer wants to run unit tests using common Python testing frameworks on Python functions defined across several Databricks notebooks currently used in production.How can the data engineer run unit tests against functions that work with data in production?


Question 227

Question 227 Image 1

Question 227 Image 2

A data engineer wants to refactor the following DLT code, which includes multiple table definitions with very similar code.In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for these tables.How can the data engineer fix this?


Question 228

A data engineer has created a 'transactions' Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool which requires Apache Iceberg format.What should the data engineer do?


Question 229

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields. In total, 15 fields have been identified that will often be used for filter and join logic.The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?


Question 230

A platform engineer is creating catalogs and schemas for the development team to use.The engineer has created an initial catalog, Catalog_A, and initial schema, Schema_A. The engineer has also granted USE CATALOG, USE SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.What explains the engineer's lack of access to the underlying tables?


Question 231

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.What are the minimal cluster permissions that allow the development team to accomplish this?


Question 232

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers and one driver of type i3.xlarge and should use the '14.3.x-scala2.12' runtime.Which command should the data engineer use?


Question 233

A 'transactions' table has been liquid clustered on the columns 'product_id’, ’user_id' and 'event_date'.Which operation lacks support for cluster on write?


Question 234

Question 234 Image 1

The data governance team has instituted a requirement that the "user" table containing Personal Identifiable Information (PII) must have the appropriate masking on the SSN column. This means that anyone outside of the HRAdminGroup should see masked social security numbers as --***.The team created a masking function:What does the data governance team need to do next to achieve this goal?


Question 235

A data engineer needs to create an application that will collect information about the latest job run including the repair history.How should the data engineer format the request?


Question 236

A data engineer is working in an interactive notebook with many transformations before outputting the result from display(df.collect() ). The notebook includes wide transformations and a cross join.The data engineer is getting the following error: "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."Which action should the data engineer take?


Question 237

An analytics team wants run an experiment in the short term on the customer transaction Delta table (with 20 billions records) created by the data engineering team in Databricks SQL.Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?


Question 238

A data team is working to optimize an existing large, fast-growing table 'orders' with high cardinality columns, which experiences significant data skew and requires frequent concurrent writes. The team notice that the columns 'user_id', 'event_timestamp' and 'product_id' are heavily used in analytical queries and filters, although those keys may be subject to change in the future due to different business requirements.Which partitioning strategy should the team choose to optimize the table for immediate data skipping, incremental management over time, and flexibility?